knitr::opts_chunk$set(warning=FALSE, message=FALSE)

1 Welcome

data gathered from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

2 Cleaning the data

Here I am doing a few things:

options(stringsAsFactors=FALSE)
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(plotly)
library(highcharter)

#function for grabbing status from my filenames
getStatus<- function(x){
  strsplit(x, "-")[[1]] %>%
    last() %>%
    gsub(pattern="\\.csv", replacement="")
}

#function created for adding an active status
createActive <- function(x){
  dcast(Country.Region + Date ~ Status,
        data=x, value.var="Value") %>%
    mutate(Active = Confirmed - (Deaths + Recovered)) %>% 
    melt(id = c("Country.Region", "Date"))
}

#function used to convert a dataset from long to wide
convert <- function(x){ dcast(Country + Date ~ Status,
        data=x, value.var="Value")}

###########################################
### This is where I start cleaning the data
files <- list.files("raw", full.names = TRUE)

raw <- files %>% #read in files
  lapply(function(x){
  read.csv(x) %>% 
    mutate(
      Date = as.Date(Date, "%m/%d/%Y"),
      Status = getStatus(x) #add status
    )
}) %>% 
  bind_rows() %>% #combine files
  subset(!(Value==0) )

raw <- raw %>% #combine countries into one
  group_by(Country.Region, Date, Status) %>% 
  summarise(
    Value = sum(Value)
  )

raw <- raw %>% #get global stats
  group_by(Date, Status) %>% 
  summarise(
    Value = sum(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Country.Region = "Global"
  ) %>% 
  bind_rows(raw)

raw <- raw %>% #create active columns, delete nulls, rename
  createActive() %>% 
  subset(!is.na(value)) %>% 
  rename(
   Country = Country.Region,
    Value = value,
    Status = variable
  )

total <- raw %>% #Get total values for seperate df
  group_by(Country, Status) %>% 
  summarise(
    Value = max(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

#get change per day
raw <- raw %>%
  group_by(Country, Status) %>% 
  mutate(
    Change =  Value - lag(Value, k=1),
    Rate_Change =  (Value - lag(Value, k=1))/lag(Value, k=1) 
    ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

Now that the data is clean, lets start digging into the data!

3 A Quick Analysis

Lets start with looking at the overall feel of our top 5 Countries.

3.1 Cumulative Total Cases

The above graph shows the differences of cumulative cases by type, and by country. We can see some pretty interesting things here. For example, Italy has the highest deaths and active cases, yet they are only second in the number of confirmed cases. This could mean a few things:

  • Italy may not have been testing enough people, which would start to bring their Mortality Rate down.

  • Italys’ healthcare system is getting overrun, and their deaths are about to skyrocket.

This graph perfectly shows just how many Deaths Italy has compared to our other four countries. At the time of writing this, Italy has just about doubled their death count over Chinas’!

For the remainder of this analysis, I will be focusing on Italy.

3.2 Rates of Change

Theres a couple things to keep in mind when looking at the rate of change. For example,

  • It has more variance in the beginning.

  • As the number of cases get larger, the rate will tend to even out, or even decline.

This is the basic Exponential growth function:

\(f(t) = P_0(1+r)^t\)

The rate is what gets exponentiated for the function, ie, if our rate is higher, our confirmed cases will get much higher with each day. Also, if the rate is lower that means our cases per day has dropped from the previous days’ number. If there starts to be a trend of consecutively lower rates of change, this means that the disease is starting to slow down, which be indicitive of being past the peak.

This graph clearly shows that the deaths rate of change has not only a higher average, but most of the rates are higher than the rate of confirmed. This tells us the number of deaths are going up much faster than our confirmed cases are.

There are a multitude of reasons why this could happen:

  • The Confirmed cases are lagging behind, due to the incubation period.

  • The Confirmed cases are innacurate, as finding this true value is difficult.

  • Our Mortality rate is not static,

    • This is a likely scenario for a few reasons,
      • The healthcare systems can be overrun, causing more deaths.
      • Population ages vary depending on location, and we know Covid-19 has a higher Mortality rate among an older population, -As well as much more unknown factors.
  • The Rates of change will tend to even out over time, while will really be shown in the Changes of the Rates themselves.

    • Meaning in the beginning stages of the rate of change, it will start with a lot of extreme jumps.
      • For example, if we have 1 person infected today, and another tomorrow, for a total of two Deaths, the rate of change from day 1 to day 2 will be 1.0.
      • Also, if the next day has no Deaths reported, then the rate of change will stay at 1.0, however the change in that rate per day will be 0.

This will be important for how we build the Monte Carlo Algorithm.

3.3 Deeper look at Italys’ Death Rate of Change

Looking at this scatter plot, as the number of Deceased cases rise, the Rate of Change trends downwards. This means that Italy has a negative correlation between the actual change in cases per day and the rate of change. This is a good thing, as it could mean a few things:

  • The infection rate is starting to slow down.

  • The rate of change is starting to average out to a lower number.

Lets find out how well correlated the two actually are, using pearsons r method.

Before we start, I will get the outliers out of the data, as some of the early rates as those are innacurate, for the reasons described before.

Now that that is done, lets check to see if our Rate of Change distribution is relatively normal.

3.4 Shapiro-Wilk Test

Lets use a shapiro Wilk Test to check its normality. But first, I will explain what the test is:

*The null-hypothesis of this test is that the population is normally distributed. + Thus, if the p value is less than the chosen alpha level, - then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. -likewise, if the p value is greater than the alpha level, then the data is normally distributed!

  • It is regarded as being the best test for checking the significance the above null Hypothesis;
    • fun fact: this was tested by Monte Carlo Simulations.
## 
##  Shapiro-Wilk normality test
## 
## data:  std_corr$Rate_Change
## W = 0.94503, p-value = 0.1483

Awesome! this tells us that the p value is 0.1483 which is greater than 0.05 (a 95% confidence interval)!

This is great news! As it means that the null hypothesis is rejected! Telling us two things:

  • The Rate of Change Distribution is not significantly different from a normal distribution.

  • The W value is 0.945 which is less than 1, but only by 0.05497,

    • This means that the slope of its QQ Plot is relatively close to 1, telling us this data is really close to a normal distribution.

In summary, this means that the Rate of Change, in the Deaths, can be used as a predictor for the Monte Carlo Simulator!

3.5 Pearsons r and r^2

Lets take a look into pearsons r

As expected there is a negative correlation between the Rate of Change and Total cases, aswell as a negative correlation between the Rate of Change and the cases per day.

This makes sense as the amount of cases increase, our rate should start having less variations. This is also great news, as it means our rate is trending downwards, ultimately suggesting that the rate of Deaths per day is slowing down.

That is about all the information we can get out of r, so lets take a loot at r squared!

The r squared value can tell us much more, for example, there is a r^2 value of 0.88 between the Cumulative deaths and the Cases per day. This means 88.7% of the Cases per day can be explained by our Total cases, or vice versa! This goes on for all of our relatively high r^2 values, showing a strong correlation between a lot of the different calculations.

Let’s check if these values are statistically significant, to do that we need to do the below test.

3.6 Testing the Correlations

null Hypothesis:

  • There is not a significant linear relationship between the following categories:
    • Cumulative Deaths and Cases per day,
      • \(r^2=0.88\)
    • The Rate of Change and the Cumulative Deaths,
      • \(r^2=0.279\)
    • The Rate of Change and the Changes in those Rates,
      • \(r^2=0.469\)
  • Therefore, the regression line CANNOT be used to model a linear relationship in the population.

////////////// EXPLAIN FORMULA

\({t}=\frac{r*\sqrt[2]{n-2}}{\sqrt[2]{1-r^2}}\)

##                  Deaths     Change Rate_Change     D_RC_C
## Deaths              Inf 14.0777415    3.230726 0.01932788
## Change      14.07774150        Inf    2.035128 0.98035389
## Rate_Change  3.23072584  2.0351284         Inf 5.15919843
## D_RC_C       0.01932788  0.9803539    5.159198        Inf

These are the test statistics for each r value, now we need to use the below formula to check if the r value is statistically significant.

\({-ts} > r > ts\)

Here is a matrix of our r values.

##                   Deaths     Change Rate_Change      D_RC_C
## Deaths       1.000000000  0.9381351  -0.5280149 0.003719627
## Change       0.938135074  1.0000000  -0.3646870 0.185398322
## Rate_Change -0.528014886 -0.3646870   1.0000000 0.704578921
## D_RC_C       0.003719627  0.1853983   0.7045789 1.000000000

This tells us that our all our correlations are statistically significant! Which means that we can use all of our factors as a predictor of the other.

This is very important for building the algorithm to the Monte Carlo Simulator.

3.7 Quick Notes: Infected Cases

There are two things I want to point out about Covid-19:

  • The true Cumulative Infected Cases would consist of everybody that has at some point contract Covid-19,
    • including those that are Deceased, Recovered, or Actively sick.
  • We know the median incubation period is 5 days.
    • This means that there is most likely a 5 day lag in our confirmed cases, as 50% of infected individuals will not experience any symptoms before the incubation period.
      • This is ontop of an innacurate Cumulative Confirmed Cases.
  • We know the average days from illness onset to death is 18 days.
    • This can help us estimate the amount of infections 23 days prior,
      • as it takes about 5 days for the illness onset to begin after the time of infection.

Using these facts, we can estimate the average time from point of infection to death is 23 days (5 incubation days + 18 days till death). This means that if someone dies today from Covid-19 we can estimate that they got infected 23 days prior. This will help us later on for assessing the total amount of Infected Cases.

This is the reason is is best to use the Deaths as the base predictor for the Monte Carlo Simulator.

3.8 Sources on facts

https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19?source=history_widget

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30566-3/fulltext

///////////////////////////////// /// START REWORDING HERE /////////////////////////////////

4 Building the Monte Carlo Simulation

5 The math behind the Monte Carlo Simulation

\(f(t)=P_0*(1+r)^t\)

We are calculating this one day at a time, which means our t value will be 1, giving us this:

FIXING FIXING FIXING FIXING

FIXING FIXING FIXING FIXING

In plain terms, the day to predict is equal to the change in our rate, times the previous day, plus the previous day. This means, if we try to predict the change in our rate, we can find our predicted Cumulative Deaths!

5.1 The Algorithm

To start, there are a few things to note about how I built this algorithm:

  • I will build it using the mean and standard deviation from the past 8 days.
    • What this means is, we will only look at the last 8 rates of change to base our next day off of.
    • Once we predict a future rate, we throw away the oldest rate and re-model our predictions!
      • This is to make sure that our model changes in accordance with our predictions, otherwise we would start to see a more linear aspect to our predictions.
  • I have split this algorithm into multiple functions to stop at certain points, in order to explain whats going on.

Lets dive in by taking a look at how our rates of change work!

6 Prediction Analysis and building

To start, we know how much our Deaths are changing per day, we can formulate what percentage this change is from our cumulative Deaths. To do this, we need the following formula:

\(r=\frac{\Delta f(t)}{P_0}\)

In plain terms, this formula takes the change per day of our deaths, and divides that by our previous days cumulative Deaths! Making the deaths per day into a percentage of our cumulative deaths!

##          Date Deaths Deaths_PD Deaths_RC
## 27 2020-03-19   3405       427 0.1433848
## 28 2020-03-20   4032       627 0.1841410
## 29 2020-03-21   4825       793 0.1966766
## 30 2020-03-22   5476       651 0.1349223
## 31 2020-03-23   6077       601 0.1097516

Now that this step is complete, we can calculate the change in our death rate of change. Essentially looking at how much our rates are fluctiating over time. We can do this simply by taking our first deaths rate of change and subtracting the second days rate of change from that. This will mean if the rate goes up, the change will go up as well!

##          Date Deaths Deaths_PD Deaths_RC      D_RC_C
## 27 2020-03-19   3405       427 0.1433848 -0.04638745
## 28 2020-03-20   4032       627 0.1841410  0.04075615
## 29 2020-03-21   4825       793 0.1966766  0.01253562
## 30 2020-03-22   5476       651 0.1349223 -0.06175431
## 31 2020-03-23   6077       601 0.1097516 -0.02517064

Great, here are Italys’ cumulative Deaths, Deaths per day, Deaths rate of change, and the change in the rate of change. We will use all this information to work backwards at the end, to fill in the predictions.

This is the distribution of the rate of change, we want to take a look at this to make sure it is relatively normal, otherwise there will be a different method needed to make predictions from.

## 
##  Shapiro-Wilk normality test
## 
## data:  italy$D_RC_C
## W = 0.89634, p-value = 0.2677

Exactly what we wanted to see, our p value is > 0.05, meaning our distribution is normal enough to find the probabilities!

6.1 Building the predictions

Now that we have the changes per day, we can build the prediction model. I will break this down into steps:

  • Use a changing mean and standard deviation:

    • Think of this as a revolving door, we only want to look at the past 8 days’ changes.
  • Generate a random number for each day, per trial

  • Calculate backwards to get the Cumulative Deaths

### functions used inside function below
get_rc <- function(death_rc_n1, change){
  #rc = lag death_rc + change
  rc=death_rc_n1+change
  return(rc)
}

get_pd <- function(deaths_n1, deaths_rc){
  #lag death*roc = pd
  
  pd = (deaths_rc*deaths_n1)
  if (pd <0){
    pd = pd * -1
  }
  
  return(ceiling(pd))
}

get_death <- function(lag_death = lag_death, dpd){
  #lag death + death per dat = next cumulative death
  if(dpd <0){
    dpd <- dpd * -1
  }
  
  death = lag_death+dpd
  return(ceiling(death))
}

### function used for prediction model
prediction_model <- function(data, days, trials, split, curr_d){
  curr_d = max(data[data$D_RC_C != 0,"Date"])
  
  for (n in 1:trials){
    temp <- data %>% 
      subset(select = c(Date, D_RC_C))

    for (i in 1:days){
      
      mu <- temp %>% #grabbing mean
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        mean()
      
      sd <- temp %>% #grabbing standard deviation
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        sd()
      
      set.seed(i*13*n)
      temp <- temp %>% 
        bind_rows(data.frame(Date = as.Date(curr_d, format = "%Y-%m-%d")+i, D_RC_C = rnorm(1, mu, sd))) %>% 
        mutate(
          Trial = n
        )
    }
    
    if (n == 1){
      trial_data <- temp
    }
    else{
      trial_data <- temp %>% 
        bind_rows(trial_data)
    }
  }
  
  return (trial_data)
}
curr_d <- max(italy[italy$D_RC_C != 0,"Date"])

TRIALS <- 20
DAYS <- 10
SPLIT <- 8

italy <- italy %>% 
  prediction_model(days = DAYS, trials = TRIALS, split = SPLIT, curr_d) %>% 
  group_by(Trial, Date) %>% 
  slice(n()) %>% 
  ungroup() %>% 
  mutate(
    Deaths = NA,
    Deaths_PD = NA,
    Deaths_RC = NA,
    Country = "Italy"
  )

7 Taking a look at our Simulations Results!

As shown, these are the 20 simulations for a 10 day forecast. Lets take a deeper look.

Here we can see the clustering of our Simulations. It seems the farther the forecast, the less clustered it becomes. This means that the Variance is growing as we predict the days, something that is expected as this is using a Markov Technique.

Lets take a look at the distribution of these indivdual days!

This shows what exactly is going on with each day that the algorithm predicts. As the number of predicted days increase, the distribution of our Cumulative Deaths flatten out. This means that there is more variation, or ‘less accuracy’, with each day that the algorithm predicts. This tells us that it will be more adventageous to use this algorithm as a short term predictor, rather than a long term predictor.

Now lets take a look our individual rates!

Now we can see that there is a clear split between rates that are going downwards and rates that are climbing upwards! This is great news, as it can generally imply that the Rate of Change will start to climb down, suggesting that what has been going on in the past 8 days is directly effecting the rate at which Covid-19 is spreading!

7.1 Running different hyperparameters

One thing that is possible with this algorithm, is changing the hyperparameters. I am going to import a dataset of this algorithm ran with more trials and different hyperparameters. I want to specifically limit the range to a 6 day forecast.

Here we can see that when we use a revolving mean and standard deviation, the results can vary vastly depending on the chosen window. In the above chart, the variation of the data seems to explode when anything over 10 is chosen. This is most likely due to the fact that the rates were vastly different in the beginning stages of the spread, as the overall numbers are lower. This suggests that we should be looking at a window that is less than 10, for better results.

There is one thing I want to point out, when observing the 30 day window, the median Rate of Change is the lowest one out of all other options, for all except the first day. This suggests, that although the variance between the trials is higher, the median is still converging to a lower number.

Lets take a look deeper look at the simulations with windows that are less than 10.

It seems that nearly every simulation showed a median change that is negative. This would highly indicate that the Change in Rates is slowing down, meaning that the Rate of Deaths is overall slowing down.

7.2 Comparing the Simulation to Real Life

lets take a look at the mean Deaths for the Trials, and compare that with the actual data we now have for italy!

This chart shows the actual Cumulative deaths for March 24 - 29, as well as the Predicted values. This tells us a few things about the simulations:

  • The closest hyperparameter to match the actual data is the Window of 9 days.
    • Simulating with an error within 200 Deaths per day!
  • Our predictions seem to overall get worse over time
    • Again, this suggests that our model should only be used for short term predictions.

8 Conclusion